Academic Open Internet Journal ISSN 1311-4360 |
Volume 22, 2008 |
IMPROVED INFORMATION EXTRACTION FOR TEXT MINING
WITH SOFT MATCHING RULES AND EXCEPTIONS
A. CHRISTY
RESEARCH SCHOLAR
SATHYABAMA UNIVERSITY
E-mail : Christy_a1@hotmail.com
P.THAMBIDURAI
PROFESSOR
PONDICHERRY ENGINEERING COLLEGE
E-mail : ptdurai58@yahoo.com
ABSTRACT
The popularity of the Internet and the large number of documents available in electronic form in modern world has motivated the search for hidden knowledge in text collections. As the Internet presents numerous sources of useful information, accessing and extracting their content is difficult. Information extraction (IE) software identifies and removes relevant information from texts, pulling information from a variety of sources, and aggregates it to create a single view. Information extraction can be of two types: natural language processing and wrapper induction. In this paper, we propose an algorithm for Information Extraction using NLP technique in the form of scenario template production. Extensive linguistic knowledge is not necessary for successful IE. The system is constructed using inductive learning with soft matching rules combined with exceptions for extracting rules. The system is tested with a collection of documents from NSF Research abstracts and abstracts from two different domains of www.computer.org, we have found the system has improved its recall value after the application of exceptions.
KEYWORDS : Parsing, Feature set Extraction, Rule Extraction, Precision, Recall, etc.
INTRODUCTION
The purpose of text mining is to research technologies to discover useful knowledge from enormous collections of documents, and to develop a system to provide this knowledge and to support in decision making. IE translates content into a homogeneous form through technologies like XML (eXtensible Mark-up Language). The Target of opportunity for IE is to provide texts whose function is the communication of factual information. IE is not a stand-alone task that human analysts typically engage in. The goal of IE software is to transform texts composed of everyday language into a structured, database format. In this way, heterogeneous documents are summarized and presented in a uniform manner. To improve accuracy and ease development, IE software is usually domain or topic specific.
The two basic approaches to the design of IE are Knowledge Engineering Approach and Automatic Training Approach. In Knowledge Engineering approach, Grammars are constructed by hand, Patterns are discovered by domain experts and also the approach requires much laborious tuning and hill climbing, whereas the Automatic Training Approach learns rules from annotated corpora, by interaction with the user and uses statistical methods for discovering patterns whenever possible. A knowledge engineering system is used when linguistic resources like lexicons are available, a skilled rule writer available, training data is sparse, or expensive to obtain, it is critical to obtain the last small increment of performance, and the extraction specifications are likely to change slightly over time. Automatically trained systems are best deployed in complementary situations, where resources other than raw text are unavailable, training data can be easily and cheaply obtained, task specifications are stable, and absolute maximum performance is not critical.
An IE system designed to monitor technical articles about Information Science, for example, could pull out the names of professors, research studies, and topics of interest, conferences, and forthcoming publications from press releases, news stories, or emails and encode this information in a database. End-users can then search across this database by textual attribute or feature.
IE can be further facilitated using bag of words, Natural language Processing (NLP) techniques, Conditional Random field, Machine learning, patterns in the form of templates, etc. are some of the methods used for transforming the documents into a collection of concepts, in the structured form. In this paper, we propose a rule induction method for automatically extracting the key information like aim, methodology and conclusion specified by authors in technical abstracts and also from NSF Research Abstracts in which the information is extracted in the form of Named Entity and scenario template production form. When discovering patterns in text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to its usage. In order to find the matching of strings, we have use the soft-matching rules in the inductive method with exceptions that integrates rule-based and automatic training approach.
The goal of an IE system is to find specific data in natural-language texts. The data to be extracted is typically given by a template which specifies a list of slots to be filled with substrings taken from the document. Linguistic evidence shows that an abstract in a given domain follows a prototypical and even modular organization. From a scientific viewpoint, there are also claims that important findings could be searched by linking this kind of information across the documents [1]. It is observed that authors use a definite set of verbs for stating the predicates. For e.g., ‘propose’ is a verb which is always used by the authors to indicate the aim and ‘show’ is a verb which is always used to state the conclusion. IE task extracts information at a rhetorical and semantic level, and uses the information in a rule-like form to represent the document’s key facts. Computationally, rule + exception strategies focus on summarizing and understanding data in concise forms. Rule accuracy isn’t necessarily a preferred criterion for rule generation, but the system’s overall accuracy can be improved by adding exceptions [4].
MOTIVATION AND BACKGROUND
Formally evaluating NLP information extraction software requires compiling a corpus of texts and manually creating an answer key. Texts are run through an IE software system and a template of answers is produced. This template is then measured against an answer key that specifies what information should be extracted from a text. Each software system fills empty template slots with appropriate values derived from the test documents.
The motivation for IE is that it is representative of a general challenge for learning. Given a set of concepts to be recognized and labeled, how can an algorithm learn to reliably spot and label new examples of the concepts ? IF-THEN rules are one of the most expressive representations for learned hypotheses. The rules can be used to categorize previously unseen data. In general, classification rule learning methods extract rules from decision trees, adopt sequential set covering algorithms, or translate neural nets into human-readable rules. Most of them assume that examples are represented as feature vectors, the components of which are either real numbers or nominal values [3]. Two widely-used schemes for rule-learning are C4.5 rules and Ripper. Both of them generate concise and human-readable outputs, rule sets. C4.5 rules induces rules from binary data by learning decision trees and translating them into pruned rules. The algorithm generates a set of rules for each path from a tree learned by C4.5. It then checks if the rules can be generalized by dropping conditions. C4.5 can handle training data with continuous attributes or missing attribute values and has been successfully applied to a wide variety of machine-learning tasks. Ripper is a fast rule learner with an ability to handle set-valued features. Ripper, based on the incremental reduced error pruning algorithm, splits the available training data into a growing set and pruning set before learning rules. It is shown that Ripper scales nearly linearly with the number of examples in the training set.
ARCHITECTURE OF IE SYSTEMS
The general architecture of an IE system comprises the components as depicted in Fig.1. The process starts with the splitting of a given sentence into various tokens (words), from which the stop words, such as the, a, an, it, etc. are removed, as they contribute no meaning for recognition of key terms used for IE. The Morphological and lexical processing concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language (for e.g., the meaning of the word proposed is derivable from the meaning of the verb propose (the stem word) and the inclusion of suffixes may transforms a verb into adverb. The morphological processing deals with the identification of stem word, which is a verb. Syntactic Analysis concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Domain analysis includes the general knowledge about the structure of the world that language users must have in order to, for example, maintain efficient knowledge discovery
SIMILARITY METRICS
Most of the existing text mining techniques discover rules requiring an exact match. However, due to the heterogeneity problem, a form of soft-matching is needed to construct an effective text mining system. Soft-matching requires a method to determine the distance between two textual items or documents. Similarity of text can be measured using standard bag of words (BOW) or edit-distance measures. The TFIDF (Term Frequency, Inverse Document Frequency) weighting scheme is used to assign higher weights to distinguished terms in a document. TFIDF makes two assumptions about the importance of a term. First, the more a term appears in the document, the more important it is (term frequency). Second, the more it appears through the entire collection of documents, the less important it is since it does not characterize the particular document well (inverse document frequency). In the TFIDF framework, the weight for term tj in a document di, wij is defined as follows:
wij = tfij * log2 N/n
where tfij is the frequency of term tj in document di, N is the total number of documents in a collection, and n is the number of documents in which term tj occurs at least once.
INFORMATION EXTRACTION SYSTEM
In order to extract this initial key information from the texts, an IE module was built. Essentially, it takes a set of text documents, has them tagged through a previously trained part-of-speech (POS) tagger, and produces an intermediate representation for every document (i.e., template, in an IE sense), which is then converted into a general rule. A set of hand-crafted domain-independent extraction patterns as shown in fig.2 were written and coded. For this purpose, each pattern constructs an output representation which involves linguistic knowledge: the rhetorical role (aim, method, conclusion) and the action represented by the predicate relation and its arguments (partial sentences) . From the Fig.2, present, introduce and show identified as the verbs are called as predicates and the sequence of token describing the sentence is called as the arguments
( ‘Document 1’, [% Preceding information
aim (present ( ‘data mining, visualization ,
techniques……..’ ) ),
method (introduce ( ‘powerful, visual, data mining ,
environment..’ )) ],
{% Succeeding information
conclusion ( show ( ‘learn, more, more, quickly, such,
integrated...’ ))
}
)
Fig.2
FEATURE EXTRACTION
Feature extraction play a vital role in information extraction, since the template is filled only by the occurrence of the specified feature. We have used Bagging, which works by training each classifier on a random sample from the training set. Bagging has the important advantage that, it is effective on “unstable learning algorithms” (Breiman 1996), where small variations in the input data can cause large variations in the learned theories. A second advantage of bagging algorithm is highly parallel and thus makes the pattern matching effective. The model parses the sentences using trigram model and lexical analysis is carried out . If the current token is a verb and the previous token is any one among Noun, Pronoun or prep and if the next token is any one among Det,Pro, Prep,Noun or Adj then the verb is identified as the predicate to be retrieved.
If Previous token is in the list (Noun, Pronoun, Prep) and if the next token is one among the token in the list (Det, Pro, Prep, Noun, Adj) , Using Soft-matching rules, both the sets can be related together. The Exception included in the model is “if the current token is “is (be), ” retrieve the next verb. If the next token is among one among the set “Conj, Adv” do not consider for generating rules. Certain nouns like “aim”, “objective” can also be used for extracting predicates. Another testing data used for extracting information is collected from NSF Research Award abstracts. In this dataset, the nouns line “aim”, “objective” plays a major role in filling the patterns. For the tokens those are extracted, calculate the term frequency using the inverse document frequency formula, (log (N) /df), where N is the total number of documents and df is the corresponding document frequency. The inverse document frequency values for each and every tokens are found and are arranged in their descending order inorder to search through the entire documents. If the calculated values lies between 0 to 1 those are considered as the valid tokens. If the value is very close to o, it satisfies the condition and extracts the exact rule, whereas if the value exceeds 1, we retrieve the next verb as the token. If any token fails to fall between 0 to 1, the entire rule is discarded.
During semantic analysis, using the Hidden Markov model, the system verifies whether the particular verb indicates the predicates and the validations done on two different domains of documents. The abstracts are taken from two different domains related to information retrieval and image processing from www.computer.org from the verbs used for stating the aim and conclusion are same in different domains and for stating the methodology , the verbs (tokens) are not common. Although there are no explicit causal relations in the above example, we can hypothesize a simple rule of the form.
IF the current goal is G1 ,... and the method used is M1,... THEN
it is true that we can achieve the conclusion C1,...
Data mining methods generally require terms in discovered rules to exactly match database entries. Normal variation in textual data items can therefore prevent the discovery of important regularities. Variations can arise from typographical errors, misspellings, abbreviations, as well as other sources. Variations are particularly pronounced in data that is automatically extracted from unstructured or semi-structured documents or web pages. An example for a sample filled template, from where the data is extracted is shown in fig.3)
Fig.3
The soft-matching rules allow non-standardized database entries to match antecedents and consequents based on relevant similarity metrics. A benefit of association-rule mining instead of classification-rule induction is that consequents of rules are not predetermined, resulting in efficient mining of all potential associations as part of a single process. When inducing classification rules, a separate learning step must
be run for predicting each possible item. The algorithm specification adopted for rule rule mining and information extraction are depicted in fig. 4 and fig.5
Input : D is the set of documents
Output : R is the set of rules
Function Rulemining(D)
Determine d, a threshold value for rule validation between 0 and 1Create a database of labeled examples by applying information extraction to the document corpusFor each labeled document d Î D do
T = set of tokens of dBuild a prediction rule, P (by using the set of tokens)For each prediction rule r Î R do
Verify r on training data and validation dataIf the accuracy of r is less than d, delete r from R.return R
Fig.4 Algorithm specification : Rule mining
Input : R is the set of prediction rulew
D is the set of documents
Output : P is the set of predicates extracted
Function InformationExtraction(R, D)
P : = fFor each example d Î D do
Extract predicates from d using extraction rules and add them to PIf r is as same as the current extracted predicateExtract the predicted predicates and add it to Preturn P
Fig.5 Algorithm specification : Information extraction
We mine rules for identifying the occurrence of the current token preceded and followed by the proper order specified and finding the threshold (inverse document frequency , between 0 and 1). If the rules satisfies the condition, they are added to the rule, else it is pruned. For each rule extracted, see whether the training set of data matches the current token, if it matches the rules are extracted and stored in the structured format. Out of 500 documents extracted from www.computer.org, we have extracted 120 rules and from 750 documents collected from NSF Research Abstracts, 200 rules are finally extracted. The accuracy of the rules extracted are found and the results are tabulated as shown in fig. 6.
Number of examples for Rule Mining |
Number of exceptions |
Precision |
Recall |
500 750 |
25 30 |
97.4 95.3 |
76.8 72.6 |
Fig.6
CONCLUSION
In this work, we have introduced an approach to mine rules from the extracted data to improve the recall of information extraction. However, this approach was limited by the requirement that the antecedents and consequents of mined rules as well as exceptions will exactly match only for a particular domain as the exceptions are implicit to the application. The soft-matching rules, maintains a association mapping between the associated tokens of the current token. It is observed that 75% of the rules are pruned, since the token used for identifying methodology in different domains varies.
REFERENCES
[1] John Abutridy , Chris Mellish and Stuart Aitken . “Combining Information extraction with genetic algorithm for text mining”,IEEE Intelligent Systems, May/June 2004., pg 22-30
[2] Eugene Agichtein and Luis Gravano. “Extracting Relations from Large Plain-Text collections”, Columbia University.
[3] Raymond J. Mooney, Razvan Bunescu, 2005 “Mining Knowledge from Text using Information Extraction”,ACM SIGKDD Explorations, 7: pp 3-10
[4] Yiyu Yao, Fei-Yue Wang and Daniel Zeng, Jue Wang. “Rule + Exception Strategies for Security Information Analysis”, IEEE Intelligent Systems, September/October 2005 pp 52 – 57
[5] Jose´ Ramo´n Cano, Francisco Herrera, Manuel Lozano.” Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability”,Data & Knowledge Engineering 60 (2007) PP 90–108
[6] Jeonghee Yi, Tetsuya Nasukawa Razvan Bunescu_, Wayne Niblack .” Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques”, IBM Almaden Research Center.
[7] Hsin-Hsi Chen, Lun-Wei Ku. ”Description of a Topic Detection Algorithm on TDT3 Mandarin Text”, National Taiwan University
Technical College - Bourgas,
All rights reserved, © March, 2000